knitr::opts_chunk$set(echo = TRUE)
require(dplyr)
require(ggplot2)
require(kableExtra)
require(reactable)
require(reactablefmtr)
require(CSGo)
require(readxl)
require(caret)
require(corrplot)
library(reshape2)
The goal of this project is to demonstrate some concepts that we learned in the “Regression Models” course applied to real data. For that purpose we are going to use a data set from House Prices in King County provided by the professor. We are going to explore this data set ether descriptively, and also testing some Regression Models to best predict the predict the house prices in King County.
This data set includes 21613 rows one for each combination of id and date (there are 175 ids repited twice and 1 id repited three times) and 21 columns. The data is from 2014-05-02 to 2015-05-14.
id: Identification number of the property (this variable is irrelevant for the analysis)
date: Date house was sold
price: Price (the prediction target)
bedrooms: Number of Bedrooms/House
bathrooms: Number of Bathrooms/Bedrooms
sqft living: square footage of the home
sqft lot: square footage of the lot
floors: Total floors (levels) in house
waterfront: House which has a view to a waterfront
view: Has been viewed
condition: How good the condition is (larger values mean better condition)
grade: overall grade given to the housing unit, based on King County grading system (larger grades are better)
sqft above: square footage of house apart from basement
sqft basement: square footage of the basement
yr built: Built Year
yr renovated: Year when house was renovated
zipcode: zip code
lat: Latitude coordinate
long: Longitude coordinate
sqft living15: Living room area in 2015 (implies? some renovations) This might or might not have affected the lotsize area
sqft lot15: lotSize area in 2015 (implies? some renovations)
Below it is possible to see a sample of the data:
As we don’t want to have temporal dependencies in our dataset, the very first step in order to clean it would be to remove the duplicates id overtime. So, for the repetitive ids we are going to consider only the most actual price. The code below presents this step.
The second step we are going to perform will be to divide our data set in train and test. The train data set will be used to the descriptive and also to train the regression models, it will contains 80% of our data. The test data set will be the remaining 20% data that we will use only to evaluate how our model will perform with unknown data.
require(caret)
# create Data Partition
set.seed(1992)
intrain <- createDataPartition(data$price, p = 0.8, list = FALSE)
db_train_raw <- data[intrain,]
db_test_raw <- data[-intrain,]
In order to understand and summarize our target variable data set we are going to perform some descriptive analysis in our train data set as follows.
library(GGally)
roundUp <- function(x) 10^ceiling(log10(x))
db_train = db_train_raw %>%
mutate(
bathrooms = as.numeric(bathrooms)
, floors = as.numeric(floors)
, waterfront = as.factor(as.numeric(as.factor(waterfront)) - 1)
, lat = as.numeric(lat)
, long = as.numeric(long)
) %>%
mutate(
lat = lat / (roundUp(abs(lat / 47))/10)
, long = long / (roundUp(abs(long / 120))/10)
) %>%
dplyr::select(-id, -date) %>%
data.frame()
Below it is presented some descriptive analysis for the continuous variables.
The average price is 5.4196302^{5} USD, the median is equal to 4.5^{5} years old, the oldest player is 7.7^{6} USD, the youngest player is 7.5^{4} USD, and the standard deviation is 3.6790934^{5} USD.
Below it is represented the histogram and the density of the variable. It is possible to see that the distribution of price is asymmetric with a tail in the right.
require(ggplot2)
require(CSGo)
avg <- round(mean(db_train$price),2)
median <- round(median(db_train$price),2)
# histogram and density
db_train %>%
ggplot(aes(x = price)) +
geom_histogram(
aes(y = ..density..),
bins = 30,
fill = "#509997",
color = "black") +
geom_density(color = "#761954", size = 1) +
geom_vline(
xintercept = avg,
linetype = "dashed",
color = "#c62d64",
size = 1.5) +
geom_text(aes(avg, 0, label = paste("Average:", avg), hjust = -0.2, vjust = -1), color = "#c62d64") +
geom_vline(
xintercept = median,
linetype = "dotted",
color = "#87ca9d",
size = 1.5) +
geom_text(aes(median, 0, label = paste("Median:", median), hjust = 1.2, vjust = -1), color = "#87ca9d") +
ggtitle("Distribution of player's age") +
CSGo::theme_csgo() +
ylab("Price (USD)") +
xlab("Density")
Lets also create a normal qqplot to see if the prices fits a normal distribution.
The normal qqplot compares the theoretical quantile from a normal distribution with the quantile from our sample data, in our case the price variable. If the price variable follows a normal distribution we expect to see most of the points above the reference line (Theoretical = Sample). However, based on the normal qqplot, we can assume that the price variable not follows a normal distribution at all.
The average sqft_living is 2083.5 USD, the median is equal to 1914 years old, the oldest player is 1.354^{4} USD, the youngest player is 370 USD, and the standard deviation is 919.96 USD.
Below it is represented the histogram and the density of the variable. It is possible to see that the distribution of sqft_living is kind of symmetric but with some atypical values to the right.
avg <- round(mean(db_train$sqft_living),2)
median <- round(median(db_train$sqft_living),2)
# histogram and density
db_train %>%
ggplot(aes(x = sqft_living)) +
geom_histogram(
aes(y = ..density..),
bins = 30,
fill = "#509997",
color = "black") +
geom_density(color = "#761954", size = 1) +
geom_vline(
xintercept = avg,
linetype = "dashed",
color = "#c62d64",
size = 1.5) +
geom_text(aes(avg, 0, label = paste("Average:", avg), hjust = -0.2, vjust = -1), color = "#c62d64") +
geom_vline(
xintercept = median,
linetype = "dotted",
color = "#87ca9d",
size = 1.5) +
geom_text(aes(median, 0, label = paste("Median:", median), hjust = 1.2, vjust = -1), color = "#87ca9d") +
ggtitle("Distribution of player's age") +
CSGo::theme_csgo() +
ylab("Sqft Living") +
xlab("Density")
The average sqft_lot is 1.480518^{4} USD, the median is equal to 7620 years old, the oldest player is 1.164794^{6} USD, the youngest player is 520 USD, and the standard deviation is 3.830935^{4} USD.
Below it is represented the histogram and the density of the variable. It is possible to see that the distribution of sqft_lot is asymmetric with some atypical values to the right.
avg <- round(mean(db_train$sqft_lot),2)
median <- round(median(db_train$sqft_lot),2)
# histogram and density
db_train %>%
ggplot(aes(x = sqft_lot)) +
geom_histogram(
aes(y = ..density..),
bins = 30,
fill = "#509997",
color = "black") +
geom_density(color = "#761954", size = 1) +
geom_vline(
xintercept = avg,
linetype = "dashed",
color = "#c62d64",
size = 1.5) +
geom_text(aes(avg, 0, label = paste("Average:", avg), hjust = -0.2, vjust = -1), color = "#c62d64") +
geom_vline(
xintercept = median,
linetype = "dotted",
color = "#87ca9d",
size = 1.5) +
geom_text(aes(median, 0, label = paste("Median:", median), hjust = 1.2, vjust = -1), color = "#87ca9d") +
ggtitle("Distribution of player's age") +
CSGo::theme_csgo() +
ylab("Sqft Lotg") +
xlab("Density")
The average sqft_above is 1791.08 USD, the median is equal to 1560 years old, the oldest player is 9410 USD, the youngest player is 370 USD, and the standard deviation is 828.65 USD.
Below it is represented the histogram and the density of the variable. It is possible to see that the distribution of sqft_above is kind of symmetric but with some atypical values to the right.
avg <- round(mean(db_train$sqft_above),2)
median <- round(median(db_train$sqft_above),2)
# histogram and density
db_train %>%
ggplot(aes(x = sqft_above)) +
geom_histogram(
aes(y = ..density..),
bins = 30,
fill = "#509997",
color = "black") +
geom_density(color = "#761954", size = 1) +
geom_vline(
xintercept = avg,
linetype = "dashed",
color = "#c62d64",
size = 1.5) +
geom_text(aes(avg, 0, label = paste("Average:", avg), hjust = -0.2, vjust = -1), color = "#c62d64") +
geom_vline(
xintercept = median,
linetype = "dotted",
color = "#87ca9d",
size = 1.5) +
geom_text(aes(median, 0, label = paste("Median:", median), hjust = 1.2, vjust = -1), color = "#87ca9d") +
ggtitle("Distribution of player's age") +
CSGo::theme_csgo() +
ylab("Sqft Above") +
xlab("Density")
The average sqft_basement is 292.41 USD, the median is equal to 0 years old, the oldest player is 4820 USD, the youngest player is 0 USD, and the standard deviation is 444.16 USD.
Below it is represented the histogram and the density of the variable. It is possible to see that the distribution of sqft_basement has a zero inflated, maybe it could be used as categorical variable indicating if the house has a basement or not, since we have in pour train dataset that 60.77% of the houses does not have a basement and 60.77% have.
avg <- round(mean(db_train$sqft_basement),2)
median <- round(median(db_train$sqft_basement),2)
# histogram and density
db_train %>%
ggplot(aes(x = sqft_basement)) +
geom_histogram(
aes(y = ..density..),
bins = 30,
fill = "#509997",
color = "black") +
geom_density(color = "#761954", size = 1) +
geom_vline(
xintercept = avg,
linetype = "dashed",
color = "#c62d64",
size = 1.5) +
geom_text(aes(avg, 0, label = paste("Average:", avg), hjust = -0.2, vjust = -1), color = "#c62d64") +
geom_vline(
xintercept = median,
linetype = "dotted",
color = "#87ca9d",
size = 1.5) +
geom_text(aes(median, 0, label = paste("Median:", median), hjust = 1.2, vjust = -1), color = "#87ca9d") +
ggtitle("Distribution of player's age") +
CSGo::theme_csgo() +
ylab("Sqft Basement") +
xlab("Density")
Below it is presented some descriptive analysis for the categorical variables against our target variable price.
First let’s take a look in the frequency table bellow.
# freq table
table(db_train$bedrooms)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 33
## 12 153 2183 7794 5473 1275 209 33 8 6 3 1 1
Based on the table above we have decided to joing the more than 8 in the same category.
Below is presented the boxplot of price by bedrooms.
db_train_linear <- db_train %>%
mutate(
bedrooms_cat = ifelse(bedrooms >=8, "more than 8", as.factor(bedrooms))
)
# bar plot
db_train_linear %>%
ggplot(aes(x = bedrooms_cat, y = price, fill = bedrooms_cat)) +
geom_boxplot(color = "black") +
ggtitle("Price by Number of bedrooms") +
ylab("Price") +
xlab("Number of bedrooms") +
labs(fill = "Number of bedrooms") +
CSGo::scale_fill_csgo()+
CSGo::theme_csgo()
The boxplot shows that the bigger the number of bedrooms bigger the price.
First let’s take a look in the frequency table bellow.
# freq table
table(db_train$bathrooms)
##
## 0 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3 3.25 3.5 3.75 4
## 9 3 57 3042 6 1140 2416 1546 1614 4282 943 587 470 587 129 111
## 4.25 4.5 4.75 5 5.25 5.5 5.75 6 6.25 6.5 6.75 7.5 7.75 8
## 62 76 21 16 11 7 2 6 2 1 1 1 1 2
Based on the table above we have decided to joing the more than 4 in the same category and less than 1 in other category.
Below is presented the boxplot of price by bathrooms
db_train_linear <- db_train_linear %>%
mutate(
bathrooms = as.numeric(bathrooms),
bathrooms_cat = as.factor(case_when(
bathrooms >= 4 ~ "more than 4",
bathrooms < 1 ~ "0 less than 1",
TRUE ~ as.character(bathrooms))
))
# bar plot
db_train_linear %>%
ggplot(aes(x = bathrooms_cat, y = price, fill = bathrooms_cat)) +
geom_boxplot(color = "black") +
ggtitle("Price by Number of bathrooms") +
ylab("Price") +
xlab("Number of bathrooms") +
labs(fill = "Number of bathrooms") +
CSGo::scale_fill_csgo()+
CSGo::theme_csgo()
The boxplot shows that the bigger the number of bathrooms bigger the price.
First let’s take a look in the frequency table bellow.
# freq table
table(db_train$floors)
##
## 1 1.5 2 2.5 3 3.5
## 8448 1509 6569 130 490 5
Based on the table above we have decided to joing the more than 2 in the same category.
Below is presented the boxplot of price by floors
db_train_linear <- db_train_linear %>%
mutate(
floors_cat = as.factor(case_when(
floors >= 2 ~ "more than 2",
TRUE ~ as.character(floors))
))
# bar plot
db_train_linear %>%
ggplot(aes(x = floors_cat, y = price, fill = floors_cat)) +
geom_boxplot(color = "black") +
ggtitle("Price by Number of floors") +
ylab("Price") +
xlab("Number of floors") +
labs(fill = "Number of floors") +
CSGo::scale_fill_csgo()+
CSGo::theme_csgo()
The boxplot shows that the number of floors does not influential in the price.
First let’s take a look in the frequency table bellow.
# freq table
table(db_train$waterfront)
##
## 0 1
## 17018 133
Based on the table above can see that we have very unbalanced classes for the waterfront variable. The most of the data is FALSE. However the boxplot bellow suggests that when the waterfront is TRUE the price of the houses are bigger.
Below is presented the boxplot of price by waterfront.
# bar plot
db_train %>%
ggplot(aes(x = waterfront, y = price, fill = waterfront)) +
geom_boxplot(color = "black") +
ggtitle("Price by waterfront") +
ylab("Price") +
xlab("waterfronts") +
labs(fill = "waterfront") +
CSGo::scale_fill_csgo()+
CSGo::theme_csgo()
The boxplot shows when the waterfront is TRUE the higher will be the prices.
First let’s take a look in the frequency table bellow.
# freq table
table(db_train$view)
##
## 0 1 2 3 4
## 15450 272 770 407 252
Below is presented the boxplot of price by view.
db_train_linear <- db_train_linear %>%
mutate(
view = as.factor(view)
)
# bar plot
db_train_linear %>%
ggplot(aes(x = view, y = price, fill = view)) +
geom_boxplot(color = "black") +
ggtitle("Price by view") +
ylab("Price") +
xlab("view") +
labs(fill = "view") +
CSGo::scale_fill_csgo()+
CSGo::theme_csgo()
The boxplot shows that bigger the view bigger the price.
First let’s take a look in the frequency table bellow.
# freq table
table(db_train$condition)
##
## 1 2 3 4 5
## 20 129 11120 4526 1356
Based on the table above we decided to joing the categories 1 and 2 into the same category.
db_train_linear <- db_train_linear %>%
mutate(
condition_cat = ifelse(as.numeric(condition) <= 2, "less than 2", as.factor(condition))
)
# bar plot
db_train_linear %>%
ggplot(aes(x = condition_cat, y = price, fill = condition_cat)) +
geom_boxplot(color = "black") +
ggtitle("Price by condition") +
ylab("Price") +
xlab("condition") +
labs(fill = "condition") +
CSGo::scale_fill_csgo()+
CSGo::theme_csgo()
The boxplot shows that the condition has no influential in the price.
First let’s take a look in the frequency table bellow.
# freq table
table(db_train$grade)
##
## 3 4 5 6 7 8 9 10 11 12 13
## 3 20 183 1577 7147 4832 2075 921 310 71 12
Based on the table above we decided to joing the categories 3, 4 and 5 into the same category and the 11, 12, 13 in other category.
db_train_linear <- db_train_linear %>%
mutate(
grade = as.numeric(grade),
grade_cat = as.factor(case_when(
grade <= 5 ~ "less than 5",
grade >= 11 ~ "mores than 11",
TRUE ~ as.character(grade))
))
# bar plot
db_train_linear %>%
ggplot(aes(x = as.factor(grade_cat), y = price, fill = as.factor(grade_cat))) +
geom_boxplot(color = "black") +
ggtitle("Price by grade") +
ylab("Price") +
xlab("grade") +
labs(fill = "grade") +
CSGo::scale_fill_csgo()+
CSGo::theme_csgo()
The boxplot shows that bigger the grades bigger the price.
Now we are going to perform a correlation study to see if the are any continuous explanatory variables are related to each other and also the most correlated variables with the price.
require(corrplot)
db_train_contin <- db_train_linear %>%
select(
price,
sqft_living,
sqft_lot,
sqft_above,
sqft_basement
)
cor_result <- cor(db_train_contin, method = "pearson")
corrplot::corrplot.mixed(cor_result)
It is possible to see that sqft_living and sqft_above presented a high correlation (0.88), so to avoid multicolinearty problems we will not consider the sqft_above in our model, this choice is also based on that the sqft_living has the higher correlation to our target variable (price) compared to sqft_above.
pairs(data %>% select_if(is.numeric) %>% select(-sqft_above))
In this session we will train a linear regression to predict our target variable price. We will also run a step wise selection.
Let’s first define our start formula. We will not consider the following variables: sqft_above, condition, and floors.
# formula
form <- formula(price ~ sqft_living + sqft_basement + sqft_lot + grade_cat + view + waterfront + bathrooms_cat + bedrooms_cat)
Now let’s train the model using a linear regression method and select only the most significant predictors:
# model
#lmStepAIC
set.seed(1992)
mod <- train(
form,
data = db_train_linear,
method = "lmStepAIC",
trace=FALSE,
na.action = na.omit)
summary(mod)
##
## Call:
## lm(formula = .outcome ~ sqft_living + sqft_basement + sqft_lot +
## grade_cat6 + grade_cat7 + grade_cat8 + grade_cat9 + `grade_catless than 5` +
## `grade_catmores than 11` + view1 + view2 + view3 + view4 +
## waterfront1 + bathrooms_cat1 + bathrooms_cat1.5 + bathrooms_cat1.75 +
## bathrooms_cat2 + bathrooms_cat2.5 + bathrooms_cat2.75 + bathrooms_cat3.25 +
## bathrooms_cat3.75 + `bathrooms_catmore than 4` + bedrooms_cat2 +
## bedrooms_cat3 + bedrooms_cat4 + bedrooms_cat8 + `bedrooms_catmore than 8`,
## data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1570917 -115050 -18995 91454 4535287
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.253e+05 1.579e+04 26.930 < 2e-16 ***
## sqft_living 1.618e+02 4.090e+00 39.558 < 2e-16 ***
## sqft_basement 4.350e+01 4.731e+00 9.194 < 2e-16 ***
## sqft_lot -3.642e-01 4.546e-02 -8.011 1.21e-15 ***
## grade_cat6 -4.141e+05 1.202e+04 -34.455 < 2e-16 ***
## grade_cat7 -3.504e+05 9.969e+03 -35.149 < 2e-16 ***
## grade_cat8 -2.678e+05 9.253e+03 -28.942 < 2e-16 ***
## grade_cat9 -1.448e+05 9.122e+03 -15.879 < 2e-16 ***
## `grade_catless than 5` -4.389e+05 1.924e+04 -22.808 < 2e-16 ***
## `grade_catmores than 11` 3.156e+05 1.417e+04 22.266 < 2e-16 ***
## view1 1.350e+05 1.369e+04 9.857 < 2e-16 ***
## view2 7.958e+04 8.376e+03 9.500 < 2e-16 ***
## view3 1.303e+05 1.152e+04 11.307 < 2e-16 ***
## view4 3.191e+05 1.819e+04 17.539 < 2e-16 ***
## waterfront1 4.639e+05 2.444e+04 18.981 < 2e-16 ***
## bathrooms_cat1 5.906e+04 7.322e+03 8.066 7.72e-16 ***
## bathrooms_cat1.5 3.068e+04 8.273e+03 3.709 0.000209 ***
## bathrooms_cat1.75 2.143e+04 6.514e+03 3.290 0.001004 **
## bathrooms_cat2 2.874e+04 7.322e+03 3.925 8.69e-05 ***
## bathrooms_cat2.5 -4.953e+04 5.467e+03 -9.060 < 2e-16 ***
## bathrooms_cat2.75 -2.269e+04 8.389e+03 -2.704 0.006849 **
## bathrooms_cat3.25 5.470e+04 1.125e+04 4.862 1.17e-06 ***
## bathrooms_cat3.75 9.913e+04 2.031e+04 4.880 1.07e-06 ***
## `bathrooms_catmore than 4` 2.025e+05 1.477e+04 13.710 < 2e-16 ***
## bedrooms_cat2 6.884e+04 1.919e+04 3.587 0.000335 ***
## bedrooms_cat3 6.997e+04 6.770e+03 10.335 < 2e-16 ***
## bedrooms_cat4 2.252e+04 4.240e+03 5.311 1.10e-07 ***
## bedrooms_cat8 -6.858e+04 3.921e+04 -1.749 0.080349 .
## `bedrooms_catmore than 8` 1.007e+05 5.133e+04 1.962 0.049817 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 221700 on 17122 degrees of freedom
## Multiple R-squared: 0.6373, Adjusted R-squared: 0.6367
## F-statistic: 1075 on 28 and 17122 DF, p-value: < 2.2e-16
So, the linear model will be:
mod$finalModel
##
## Call:
## lm(formula = .outcome ~ sqft_living + sqft_basement + sqft_lot +
## grade_cat6 + grade_cat7 + grade_cat8 + grade_cat9 + `grade_catless than 5` +
## `grade_catmores than 11` + view1 + view2 + view3 + view4 +
## waterfront1 + bathrooms_cat1 + bathrooms_cat1.5 + bathrooms_cat1.75 +
## bathrooms_cat2 + bathrooms_cat2.5 + bathrooms_cat2.75 + bathrooms_cat3.25 +
## bathrooms_cat3.75 + `bathrooms_catmore than 4` + bedrooms_cat2 +
## bedrooms_cat3 + bedrooms_cat4 + bedrooms_cat8 + `bedrooms_catmore than 8`,
## data = dat)
##
## Coefficients:
## (Intercept) sqft_living
## 4.253e+05 1.618e+02
## sqft_basement sqft_lot
## 4.350e+01 -3.642e-01
## grade_cat6 grade_cat7
## -4.141e+05 -3.504e+05
## grade_cat8 grade_cat9
## -2.678e+05 -1.448e+05
## `grade_catless than 5` `grade_catmores than 11`
## -4.389e+05 3.156e+05
## view1 view2
## 1.350e+05 7.958e+04
## view3 view4
## 1.303e+05 3.191e+05
## waterfront1 bathrooms_cat1
## 4.639e+05 5.906e+04
## bathrooms_cat1.5 bathrooms_cat1.75
## 3.068e+04 2.143e+04
## bathrooms_cat2 bathrooms_cat2.5
## 2.874e+04 -4.953e+04
## bathrooms_cat2.75 bathrooms_cat3.25
## -2.269e+04 5.470e+04
## bathrooms_cat3.75 `bathrooms_catmore than 4`
## 9.913e+04 2.025e+05
## bedrooms_cat2 bedrooms_cat3
## 6.884e+04 6.997e+04
## bedrooms_cat4 bedrooms_cat8
## 2.252e+04 -6.858e+04
## `bedrooms_catmore than 8`
## 1.007e+05
And below is presented the results of this model:
mod$results
## parameter RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 none 223205.1 0.6310368 146329.2 4937.735 0.01012315 1407.002
In this session we are going to apply the model above in our test data set to see how it performs in a new data set.
First let’s transform some categories we created before in our test data set, and then apply our model to it.
db_test_linear <- db_test_raw %>%
mutate(
bedrooms_cat = ifelse(bedrooms >=8, "more than 8", as.factor(bedrooms)),
bathrooms = as.numeric(bathrooms),
view = as.factor(view),
waterfront = as.factor(as.numeric(as.factor(waterfront)) - 1),
bathrooms_cat = as.factor(case_when(
bathrooms >= 4 ~ "more than 4",
bathrooms < 1 ~ "0 less than 1",
TRUE ~ as.character(bathrooms))
),
grade = as.numeric(grade),
grade_cat = as.factor(case_when(
grade <= 5 ~ "less than 5",
grade >= 11 ~ "mores than 11",
TRUE ~ as.character(grade))
)
)
db_test_linear <- db_test_linear %>%
mutate(
pred = predict(mod, db_test_linear)
)
Now let’s plot the predicted vs observed.
db_test_linear %>% ggplot(aes(x = price, y = pred)) +
geom_point() +
CSGo::theme_csgo()
As we can see above, our lineal model tends to predict the prize of the property larger than the actual price. Ideally, the points should be around the diagonal and in this case most of them are above it.
The model developed to predict the price was considered satisfactory. However the use of generalized additive models could improve the results, as there are some predictors that they don’t have a linear relationship with the response.
As we saw in the previous step, we realized that the relationship between the predictors and the response variable, in many cases, is not linear. Therefore, we are going to using the Generalized Additive Models to see if this type of regression fits the data properly.
Let’s plot the densities function for each predictor at the very beginning:
db_train_raw %>%
dplyr::select(-id,-date) %>%
melt() %>%
ggplot(
aes(
x = value
, col = variable
)
) +
geom_density() +
facet_wrap(~variable, scales = "free")
There are some cases where some transformation are recommended, let’s plot the density functions after this:
db_train = db_train %>%
mutate(
sqft_living = log(sqft_living)
, sqft_lot = log(sqft_lot)
, sqft_above = log(sqft_above)
, sqft_living15 = log(sqft_living15)
, sqft_lot15 = log(sqft_lot15)
, has_basement = as.factor(ifelse(db_train$sqft_basement > 0, 1, 0))
)
db_train %>% summary()
## price bedrooms bathrooms sqft_living
## Min. : 75000 Min. : 0.000 Min. :0.000 Min. :5.914
## 1st Qu.: 324844 1st Qu.: 3.000 1st Qu.:1.750 1st Qu.:7.265
## Median : 450000 Median : 3.000 Median :2.250 Median :7.557
## Mean : 541963 Mean : 3.373 Mean :2.117 Mean :7.552
## 3rd Qu.: 645000 3rd Qu.: 4.000 3rd Qu.:2.500 3rd Qu.:7.844
## Max. :7700000 Max. :33.000 Max. :8.000 Max. :9.513
## sqft_lot floors waterfront view condition
## Min. : 6.254 Min. :1.000 0:17018 Min. :0.0000 Min. :1.000
## 1st Qu.: 8.524 1st Qu.:1.000 1: 133 1st Qu.:0.0000 1st Qu.:3.000
## Median : 8.939 Median :1.500 Median :0.0000 Median :3.000
## Mean : 8.987 Mean :1.496 Mean :0.2356 Mean :3.412
## 3rd Qu.: 9.273 3rd Qu.:2.000 3rd Qu.:0.0000 3rd Qu.:4.000
## Max. :13.968 Max. :3.500 Max. :4.0000 Max. :5.000
## grade sqft_above sqft_basement yr_built
## Min. : 3.000 Min. :5.914 Min. : 0.0 Min. :1900
## 1st Qu.: 7.000 1st Qu.:7.090 1st Qu.: 0.0 1st Qu.:1952
## Median : 7.000 Median :7.352 Median : 0.0 Median :1975
## Mean : 7.665 Mean :7.397 Mean : 292.4 Mean :1971
## 3rd Qu.: 8.000 3rd Qu.:7.701 3rd Qu.: 570.0 3rd Qu.:1997
## Max. :13.000 Max. :9.150 Max. :4820.0 Max. :2015
## yr_renovated zipcode lat long
## Min. : 0.00 Min. :98001 Min. :47.16 Min. :-122.5
## 1st Qu.: 0.00 1st Qu.:98033 1st Qu.:47.47 1st Qu.:-122.3
## Median : 0.00 Median :98065 Median :47.57 Median :-122.2
## Mean : 85.18 Mean :98078 Mean :47.56 Mean :-122.2
## 3rd Qu.: 0.00 3rd Qu.:98117 3rd Qu.:47.68 3rd Qu.:-122.1
## Max. :2015.00 Max. :98199 Max. :47.78 Max. :-121.3
## sqft_living15 sqft_lot15 has_basement
## Min. :6.131 Min. : 6.479 0:10423
## 1st Qu.:7.307 1st Qu.: 8.537 1: 6728
## Median :7.518 Median : 8.939
## Mean :7.541 Mean : 8.958
## 3rd Qu.:7.766 3rd Qu.: 9.219
## Max. :8.734 Max. :13.237
db_train %>%
melt() %>%
ggplot(
aes(
x = value
, col = variable
)
) +
geom_density() +
facet_wrap(~variable, scales = "free")
## Using waterfront, has_basement as id variables
As we can see in the next correlation plot, there are some variables that are high correlated. Thus, in order to prevent collinearity, we are going to remove some variables:
corrplot(
cor(db_train %>% select_if(is.numeric))
, order = "hclust"
)
#Remove highly correlated variables to prevent collinearity
corrplot(
cor(db_train %>% dplyr::select(-grade, -sqft_above, -sqft_lot, -sqft_living) %>% select_if(is.numeric))
, order = "hclust"
)
db_train = db_train %>% dplyr::select(-grade, -sqft_above, -sqft_lot, -sqft_living)
In the next plots, we are going to see relationship between the response and the covariates.
db_train %>%
select_if(is.numeric) %>%
melt(id = "price") %>%
ggplot(
aes(y = price, x = value)
) +
geom_point() +
geom_smooth() +
facet_wrap(~variable, scale = "free")
As we can see in the plots above, there is no linear relationship between price and the covariates.
Let’s train a model which provides a good behavior under the conditions mentioned above.
library(AICcmodavg)
library(mgcv)
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## This is mgcv 1.8-38. For overview type 'help("mgcv-package")'.
library(survival)
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
library(visreg)
#Highest Correlated Variable
model.gam.liv = mgcv::gam(log(price) ~ s(sqft_living15), data = db_train)
summary(model.gam.liv)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(price) ~ s(sqft_living15)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.052216 0.003131 4169 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(sqft_living15) 7.865 8.635 1263 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.389 Deviance explained = 38.9%
## GCV = 0.16817 Scale est. = 0.16809 n = 17151
anova(model.gam.liv)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(price) ~ s(sqft_living15)
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(sqft_living15) 7.865 8.635 1263 <2e-16
visreg(model.gam.liv, "sqft_living15")
model.gam.liv.bat = mgcv::gam(
log(price) ~ s(sqft_living15) +
s(bathrooms)
, data = db_train)
summary(model.gam.liv.bat)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.052216 0.002953 4420 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(sqft_living15) 6.922 7.960 575.9 <2e-16 ***
## s(bathrooms) 7.140 7.874 269.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.456 Deviance explained = 45.6%
## GCV = 0.14972 Scale est. = 0.14959 n = 17151
anova(model.gam.liv, model.gam.liv.bat)
## Analysis of Deviance Table
##
## Model 1: log(price) ~ s(sqft_living15)
## Model 2: log(price) ~ s(sqft_living15) + s(bathrooms)
## Resid. Df Resid. Dev Df Deviance
## 1 17141 2881.3
## 2 17134 2563.3 7.1997 318.07
visreg(model.gam.liv.bat, "sqft_living15")
visreg(model.gam.liv.bat, "bathrooms")
Above we can see the estimated smooth effect s of sqft_living and bathrooms.
model.gam.liv.bat.view = mgcv::gam(
log(price) ~ s(sqft_living15) +
s(bathrooms) +
view
, data = db_train)
summary(model.gam.liv.bat.view)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.024686 0.003028 4301.98 <2e-16 ***
## view 0.116848 0.003952 29.57 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(sqft_living15) 7.215 8.193 462.8 <2e-16 ***
## s(bathrooms) 7.029 7.786 254.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.482 Deviance explained = 48.3%
## GCV = 0.14248 Scale est. = 0.14234 n = 17151
visreg(model.gam.liv.bat.view, "sqft_living15")
visreg(model.gam.liv.bat.view, "bathrooms")
anova(model.gam.liv.bat, model.gam.liv.bat.view)
## Analysis of Deviance Table
##
## Model 1: log(price) ~ s(sqft_living15) + s(bathrooms)
## Model 2: log(price) ~ s(sqft_living15) + s(bathrooms) + view
## Resid. Df Resid. Dev Df Deviance
## 1 17134 2563.3
## 2 17133 2439.0 1.1449 124.28
rbind(
c("~living15", AIC(model.gam.liv))
, c("~living15 + bathrooms", AIC(model.gam.liv.bat))
, c("~living15 + bathrooms + view", AIC(model.gam.liv.bat.view))
)
## [,1] [,2]
## [1,] "~living15" "18098.268228551"
## [2,] "~living15 + bathrooms" "16104.5097962339"
## [3,] "~living15 + bathrooms + view" "15254.4711858212"
In the previous model, we have added the variable view as a predictor in the model.
model.gam.liv.bat.view.cbase = mgcv::gam(
log(price) ~ s(sqft_living15) +
s(bathrooms) +
view +
s(sqft_basement)
, data = db_train)
summary(model.gam.liv.bat.view.cbase)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.028720 0.003001 4342.0 <2e-16 ***
## view 0.099727 0.004020 24.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(sqft_living15) 7.029 8.049 482.81 <2e-16 ***
## s(bathrooms) 6.345 7.195 214.57 <2e-16 ***
## s(sqft_basement) 8.615 8.923 47.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.494 Deviance explained = 49.5%
## GCV = 0.13923 Scale est. = 0.13903 n = 17151
anova( model.gam.liv.bat.view, model.gam.liv.bat.view.cbase)
## Analysis of Deviance Table
##
## Model 1: log(price) ~ s(sqft_living15) + s(bathrooms) + view
## Model 2: log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement)
## Resid. Df Resid. Dev Df Deviance
## 1 17133 2439.0
## 2 17125 2381.2 8.1873 57.829
model.gam.liv.bat.view.fbase = mgcv::gam(
log(price) ~ s(sqft_living15) +
s(bathrooms) +
view +
has_basement
, data = db_train)
summary(model.gam.liv.bat.view.fbase)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + has_basement
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.979134 0.003747 3464.16 <2e-16 ***
## view 0.102618 0.003969 25.85 <2e-16 ***
## has_basement1 0.124667 0.006171 20.20 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(sqft_living15) 7.03 8.049 506.1 <2e-16 ***
## s(bathrooms) 6.71 7.527 219.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.494 Deviance explained = 49.5%
## GCV = 0.13921 Scale est. = 0.13907 n = 17151
anova(model.gam.liv.bat.view.cbase, model.gam.liv.bat.view.fbase)
## Analysis of Deviance Table
##
## Model 1: log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement)
## Model 2: log(price) ~ s(sqft_living15) + s(bathrooms) + view + has_basement
## Resid. Df Resid. Dev Df Deviance
## 1 17125 2381.2
## 2 17132 2382.9 -7.5904 -1.7326
rbind(
c("~living15", AIC(model.gam.liv))
, c("~living15 + bathrooms", AIC(model.gam.liv.bat))
, c("~living15 + bathrooms + view", AIC(model.gam.liv.bat.view))
, c("~living15 + bathrooms + view + sqft_basement", AIC(model.gam.liv.bat.view.cbase))
, c("~living15 + bathrooms + view + has_basement", AIC(model.gam.liv.bat.view.fbase))
)
## [,1] [,2]
## [1,] "~living15" "18098.268228551"
## [2,] "~living15 + bathrooms" "16104.5097962339"
## [3,] "~living15 + bathrooms + view" "15254.4711858212"
## [4,] "~living15 + bathrooms + view + sqft_basement" "14858.4149482997"
## [5,] "~living15 + bathrooms + view + has_basement" "14856.3927235136"
model.gam.liv.bat.view.cbase.wtf = mgcv::gam(
log(price) ~ s(sqft_living15) +
s(bathrooms) +
view +
s(sqft_basement) +
waterfront
, data = db_train)
summary(model.gam.liv.bat.view.cbase.wtf)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) +
## waterfront
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.030081 0.002993 4353.34 <2e-16 ***
## view 0.081237 0.004355 18.65 <2e-16 ***
## waterfront1 0.386204 0.035660 10.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(sqft_living15) 6.994 8.020 492.18 <2e-16 ***
## s(bathrooms) 6.191 7.061 219.98 <2e-16 ***
## s(sqft_basement) 8.668 8.941 49.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.498 Deviance explained = 49.8%
## GCV = 0.1383 Scale est. = 0.13809 n = 17151
anova(model.gam.liv.bat.view.cbase, model.gam.liv.bat.view.cbase.wtf)
## Analysis of Deviance Table
##
## Model 1: log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement)
## Model 2: log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) +
## waterfront
## Resid. Df Resid. Dev Df Deviance
## 1 17125 2381.2
## 2 17124 2365.0 0.85592 16.143
AIC(model.gam.liv.bat.view.cbase)
## [1] 14858.41
AIC(model.gam.liv.bat.view.cbase.wtf)
## [1] 14743.47
model.gam.liv.bat.view.cbase.wtf2 = mgcv::gam(
log(price) ~ s(sqft_living15) +
s(bathrooms, by = waterfront) +
view +
s(sqft_basement) +
waterfront
, data = db_train)
anova(model.gam.liv.bat.view.cbase.wtf,model.gam.liv.bat.view.cbase.wtf2)
## Analysis of Deviance Table
##
## Model 1: log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) +
## waterfront
## Model 2: log(price) ~ s(sqft_living15) + s(bathrooms, by = waterfront) +
## view + s(sqft_basement) + waterfront
## Resid. Df Resid. Dev Df Deviance
## 1 17124 2365
## 2 17123 2364 0.85106 1.0285
AIC(model.gam.liv.bat.view.cbase.wtf)
## [1] 14743.47
AIC(model.gam.liv.bat.view.cbase.wtf2)
## [1] 14737.69
model.gam.liv.bat.view.cbase.wtf2.bed = mgcv::gam(
log(price) ~ s(sqft_living15) +
s(bathrooms, by = waterfront) +
view +
s(sqft_basement) +
waterfront +
s(bedrooms)
, data = db_train)
summary(model.gam.liv.bat.view.cbase.wtf2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms, by = waterfront) +
## view + s(sqft_basement) + waterfront
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.030022 0.002993 4353.848 <2e-16 ***
## view 0.081420 0.004356 18.694 <2e-16 ***
## waterfront1 0.346587 0.038925 8.904 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(sqft_living15) 7.003 8.027 491.84 <2e-16 ***
## s(bathrooms):waterfront0 6.029 6.906 219.32 <2e-16 ***
## s(bathrooms):waterfront1 1.000 1.000 83.16 <2e-16 ***
## s(sqft_basement) 8.660 8.940 49.10 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.498 Deviance explained = 49.9%
## GCV = 0.13825 Scale est. = 0.13804 n = 17151
anova(model.gam.liv.bat.view.cbase.wtf2.bed)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms, by = waterfront) +
## view + s(sqft_basement) + waterfront + s(bedrooms)
##
## Parametric Terms:
## df F p-value
## view 1 356.50 <2e-16
## waterfront 1 78.48 <2e-16
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(sqft_living15) 6.914 7.955 467.174 <2e-16
## s(bathrooms):waterfront0 6.823 7.588 171.551 <2e-16
## s(bathrooms):waterfront1 1.000 1.000 84.445 <2e-16
## s(sqft_basement) 8.637 8.932 45.153 <2e-16
## s(bedrooms) 7.963 8.482 8.387 <2e-16
AIC(model.gam.liv.bat.view.cbase.wtf2)
## [1] 14737.69
AIC(model.gam.liv.bat.view.cbase.wtf2.bed)
## [1] 14679.05
model.gam.liv.bat.view.cbase.wtf2.bed.flo = mgcv::gam(
log(price) ~ s(sqft_living15) +
s(bathrooms, by = waterfront) +
view +
s(sqft_basement, by = waterfront) +
waterfront +
s(bedrooms) +
floors
, data = db_train)
summary(model.gam.liv.bat.view.cbase.wtf2.bed.flo)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms, by = waterfront) +
## view + s(sqft_basement, by = waterfront) + waterfront + s(bedrooms) +
## floors
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.771877 0.010737 1189.49 <2e-16 ***
## view 0.078214 0.004293 18.22 <2e-16 ***
## waterfront1 0.354416 0.039911 8.88 <2e-16 ***
## floors 0.172952 0.006928 24.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(sqft_living15) 6.894 7.938 488.141 < 2e-16 ***
## s(bathrooms):waterfront0 7.079 7.783 75.885 < 2e-16 ***
## s(bathrooms):waterfront1 1.000 1.000 44.399 < 2e-16 ***
## s(sqft_basement):waterfront0 7.840 8.386 98.808 < 2e-16 ***
## s(sqft_basement):waterfront1 1.206 1.382 7.576 0.00197 **
## s(bedrooms) 8.101 8.587 8.428 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.517 Deviance explained = 51.8%
## GCV = 0.13296 Scale est. = 0.13268 n = 17151
anova(model.gam.liv.bat.view.cbase.wtf2.bed.flo)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms, by = waterfront) +
## view + s(sqft_basement, by = waterfront) + waterfront + s(bedrooms) +
## floors
##
## Parametric Terms:
## df F p-value
## view 1 331.90 <2e-16
## waterfront 1 78.86 <2e-16
## floors 1 623.14 <2e-16
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(sqft_living15) 6.894 7.938 488.141 < 2e-16
## s(bathrooms):waterfront0 7.079 7.783 75.885 < 2e-16
## s(bathrooms):waterfront1 1.000 1.000 44.399 < 2e-16
## s(sqft_basement):waterfront0 7.840 8.386 98.808 < 2e-16
## s(sqft_basement):waterfront1 1.206 1.382 7.576 0.00197
## s(bedrooms) 8.101 8.587 8.428 < 2e-16
AIC(model.gam.liv.bat.view.cbase.wtf2.bed)
## [1] 14679.05
AIC(model.gam.liv.bat.view.cbase.wtf2.bed.flo)
## [1] 14068.3
model.gam.liv.bat.view.cbase.wtf2.bed.flo.loc = mgcv::gam(
log(price) ~ s(sqft_living15) +
s(bathrooms, by = waterfront) +
view +
s(sqft_basement, by = waterfront) +
waterfront +
s(bedrooms) +
floors +
s(lat) +
s(long)
, data = db_train)
summary(model.gam.liv.bat.view.cbase.wtf2.bed.flo.loc)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms, by = waterfront) +
## view + s(sqft_basement, by = waterfront) + waterfront + s(bedrooms) +
## floors + s(lat) + s(long)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.981817 0.007557 1717.812 < 2e-16 ***
## view 0.070976 0.002969 23.905 < 2e-16 ***
## waterfront1 0.486422 0.029114 16.708 < 2e-16 ***
## floors 0.033176 0.004878 6.801 1.07e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(sqft_living15) 4.409 5.485 1048.602 <2e-16 ***
## s(bathrooms):waterfront0 6.713 7.491 185.949 <2e-16 ***
## s(bathrooms):waterfront1 5.970 6.766 14.376 <2e-16 ***
## s(sqft_basement):waterfront0 8.906 8.993 18.196 <2e-16 ***
## s(sqft_basement):waterfront1 1.000 1.000 3.123 0.0772 .
## s(bedrooms) 5.582 6.436 49.819 <2e-16 ***
## s(lat) 8.942 8.999 2159.931 <2e-16 ***
## s(long) 8.872 8.991 82.008 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.78 Deviance explained = 78.1%
## GCV = 0.06064 Scale est. = 0.060448 n = 17151
anova(model.gam.liv.bat.view.cbase.wtf2.bed.flo.loc)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms, by = waterfront) +
## view + s(sqft_basement, by = waterfront) + waterfront + s(bedrooms) +
## floors + s(lat) + s(long)
##
## Parametric Terms:
## df F p-value
## view 1 571.45 < 2e-16
## waterfront 1 279.15 < 2e-16
## floors 1 46.26 1.07e-11
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(sqft_living15) 4.409 5.485 1048.602 <2e-16
## s(bathrooms):waterfront0 6.713 7.491 185.949 <2e-16
## s(bathrooms):waterfront1 5.970 6.766 14.376 <2e-16
## s(sqft_basement):waterfront0 8.906 8.993 18.196 <2e-16
## s(sqft_basement):waterfront1 1.000 1.000 3.123 0.0772
## s(bedrooms) 5.582 6.436 49.819 <2e-16
## s(lat) 8.942 8.999 2159.931 <2e-16
## s(long) 8.872 8.991 82.008 <2e-16
AIC(model.gam.liv.bat.view.cbase.wtf2.bed.flo)
## [1] 14068.3
AIC(model.gam.liv.bat.view.cbase.wtf2.bed.flo.loc)
## [1] 603.4191
Adding Lat and Long decreased the AIC value from 14068 to 603 (96% less) and explained 27% of the deviance. And the interaction with waterfront doesnt seem to be relevant anymore.
model.gam.liv.bat.view.cbase.wtf.bed.flo.loc = mgcv::gam(
log(price) ~ s(sqft_living15) +
s(bathrooms) +
view +
s(sqft_basement) +
waterfront +
s(bedrooms) +
floors +
s(lat) +
s(long)
, data = db_train)
summary(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) +
## waterfront + s(bedrooms) + floors + s(lat) + s(long)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.982527 0.007554 1718.554 < 2e-16 ***
## view 0.070884 0.002967 23.893 < 2e-16 ***
## waterfront1 0.495719 0.023912 20.731 < 2e-16 ***
## floors 0.032845 0.004877 6.735 1.69e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(sqft_living15) 4.362 5.432 1066.93 <2e-16 ***
## s(bathrooms) 7.249 7.950 179.68 <2e-16 ***
## s(sqft_basement) 4.512 5.477 27.34 <2e-16 ***
## s(bedrooms) 5.224 6.087 52.57 <2e-16 ***
## s(lat) 8.937 8.999 2161.40 <2e-16 ***
## s(long) 8.878 8.991 81.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.78 Deviance explained = 78%
## GCV = 0.060691 Scale est. = 0.060539 n = 17151
anova(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) +
## waterfront + s(bedrooms) + floors + s(lat) + s(long)
##
## Parametric Terms:
## df F p-value
## view 1 570.87 < 2e-16
## waterfront 1 429.78 < 2e-16
## floors 1 45.36 1.69e-11
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(sqft_living15) 4.362 5.432 1066.93 <2e-16
## s(bathrooms) 7.249 7.950 179.68 <2e-16
## s(sqft_basement) 4.512 5.477 27.34 <2e-16
## s(bedrooms) 5.224 6.087 52.57 <2e-16
## s(lat) 8.937 8.999 2161.40 <2e-16
## s(long) 8.878 8.991 81.47 <2e-16
AIC(model.gam.liv.bat.view.cbase.wtf2.bed.flo.loc)
## [1] 603.4191
AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc)
## [1] 618.0408
model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip = mgcv::gam(
log(price) ~ s(sqft_living15) +
s(bathrooms) +
view +
s(sqft_basement) +
waterfront +
s(bedrooms) +
floors +
s(lat) +
s(long) +
s(zipcode)
, data = db_train)
summary(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) +
## waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.992193 0.007528 1725.832 < 2e-16 ***
## view 0.071146 0.002940 24.201 < 2e-16 ***
## waterfront1 0.518459 0.023621 21.949 < 2e-16 ***
## floors 0.026226 0.004865 5.391 7.11e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(sqft_living15) 5.043 6.174 999.67 <2e-16 ***
## s(bathrooms) 7.290 7.977 183.47 <2e-16 ***
## s(sqft_basement) 5.245 6.226 20.44 <2e-16 ***
## s(bedrooms) 5.455 6.315 56.34 <2e-16 ***
## s(lat) 8.837 8.991 1685.72 <2e-16 ***
## s(long) 8.564 8.906 64.19 <2e-16 ***
## s(zipcode) 8.943 8.998 59.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.786 Deviance explained = 78.7%
## GCV = 0.058921 Scale est. = 0.058738 n = 17151
anova(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) +
## waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode)
##
## Parametric Terms:
## df F p-value
## view 1 585.69 < 2e-16
## waterfront 1 481.77 < 2e-16
## floors 1 29.06 7.11e-08
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(sqft_living15) 5.043 6.174 999.67 <2e-16
## s(bathrooms) 7.290 7.977 183.47 <2e-16
## s(sqft_basement) 5.245 6.226 20.44 <2e-16
## s(bedrooms) 5.455 6.315 56.34 <2e-16
## s(lat) 8.837 8.991 1685.72 <2e-16
## s(long) 8.564 8.906 64.19 <2e-16
## s(zipcode) 8.943 8.998 59.53 <2e-16
AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc)
## [1] 618.0408
AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip)
## [1] 110.2283
Adding zipcode reduced the AIC score from 618 to 110 (85% less)
model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr = mgcv::gam(
log(price) ~ s(sqft_living15) +
s(bathrooms) +
view +
s(sqft_basement) +
waterfront +
s(bedrooms) +
floors +
s(lat) +
s(long) +
s(zipcode) +
s(yr_built) +
s(yr_renovated)
, data = db_train)
summary(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) +
## waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode) +
## s(yr_built) + s(yr_renovated)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.979657 0.008480 1530.634 < 2e-16 ***
## view 0.068875 0.002920 23.589 < 2e-16 ***
## waterfront1 0.499333 0.023419 21.322 < 2e-16 ***
## floors 0.035060 0.005546 6.321 2.66e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(sqft_living15) 5.162 6.300 979.93 <2e-16 ***
## s(bathrooms) 7.649 8.255 181.04 <2e-16 ***
## s(sqft_basement) 4.333 5.288 24.95 <2e-16 ***
## s(bedrooms) 5.189 6.053 52.52 <2e-16 ***
## s(lat) 8.824 8.990 1581.43 <2e-16 ***
## s(long) 8.548 8.900 61.11 <2e-16 ***
## s(zipcode) 8.941 8.998 51.51 <2e-16 ***
## s(yr_built) 8.795 8.987 32.00 <2e-16 ***
## s(yr_renovated) 5.201 6.117 19.10 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.792 Deviance explained = 79.3%
## GCV = 0.057437 Scale est. = 0.057214 n = 17151
anova(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) +
## waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode) +
## s(yr_built) + s(yr_renovated)
##
## Parametric Terms:
## df F p-value
## view 1 556.43 < 2e-16
## waterfront 1 454.61 < 2e-16
## floors 1 39.96 2.66e-10
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(sqft_living15) 5.162 6.300 979.93 <2e-16
## s(bathrooms) 7.649 8.255 181.04 <2e-16
## s(sqft_basement) 4.333 5.288 24.95 <2e-16
## s(bedrooms) 5.189 6.053 52.52 <2e-16
## s(lat) 8.824 8.990 1581.43 <2e-16
## s(long) 8.548 8.900 61.11 <2e-16
## s(zipcode) 8.941 8.998 51.51 <2e-16
## s(yr_built) 8.795 8.987 32.00 <2e-16
## s(yr_renovated) 5.201 6.117 19.10 <2e-16
AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip)
## [1] 110.2283
AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr)
## [1] -327.4509
model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond = mgcv::gam(
log(price) ~ s(sqft_living15) +
s(bathrooms) +
view +
s(sqft_basement) +
waterfront +
s(bedrooms) +
floors +
s(lat) +
s(long) +
s(zipcode) +
s(yr_built) +
s(yr_renovated) +
condition
, data = db_train)
summary(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) +
## waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode) +
## s(yr_built) + s(yr_renovated) + condition
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.737756 0.013572 938.518 < 2e-16 ***
## view 0.068774 0.002877 23.904 < 2e-16 ***
## waterfront1 0.498074 0.023075 21.585 < 2e-16 ***
## floors 0.030017 0.005469 5.489 4.11e-08 ***
## condition 0.073115 0.003233 22.618 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(sqft_living15) 5.212 6.352 1029.79 <2e-16 ***
## s(bathrooms) 7.316 8.002 172.08 <2e-16 ***
## s(sqft_basement) 4.384 5.343 17.66 <2e-16 ***
## s(bedrooms) 5.061 5.924 52.79 <2e-16 ***
## s(lat) 8.831 8.991 1658.24 <2e-16 ***
## s(long) 8.598 8.919 58.69 <2e-16 ***
## s(zipcode) 8.945 8.998 50.31 <2e-16 ***
## s(yr_built) 8.602 8.953 23.32 <2e-16 ***
## s(yr_renovated) 4.881 5.764 40.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.798 Deviance explained = 79.9%
## GCV = 0.055775 Scale est. = 0.055557 n = 17151
anova(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) +
## waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode) +
## s(yr_built) + s(yr_renovated) + condition
##
## Parametric Terms:
## df F p-value
## view 1 571.40 < 2e-16
## waterfront 1 465.91 < 2e-16
## floors 1 30.13 4.11e-08
## condition 1 511.56 < 2e-16
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(sqft_living15) 5.212 6.352 1029.79 <2e-16
## s(bathrooms) 7.316 8.002 172.08 <2e-16
## s(sqft_basement) 4.384 5.343 17.66 <2e-16
## s(bedrooms) 5.061 5.924 52.79 <2e-16
## s(lat) 8.831 8.991 1658.24 <2e-16
## s(long) 8.598 8.919 58.69 <2e-16
## s(zipcode) 8.945 8.998 50.31 <2e-16
## s(yr_built) 8.602 8.953 23.32 <2e-16
## s(yr_renovated) 4.881 5.764 40.25 <2e-16
AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr)
## [1] -327.4509
AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond)
## [1] -831.1318
model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond.lot = mgcv::gam(
log(price) ~ s(sqft_living15) +
s(bathrooms) +
view +
s(sqft_basement) +
waterfront +
s(bedrooms) +
floors +
s(lat) +
s(long) +
s(zipcode) +
s(yr_built) +
s(yr_renovated) +
condition +
s(sqft_lot15)
, data = db_train)
summary(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond.lot)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) +
## waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode) +
## s(yr_built) + s(yr_renovated) + condition + s(sqft_lot15)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.678685 0.013373 948.09 <2e-16 ***
## view 0.067461 0.002782 24.25 <2e-16 ***
## waterfront1 0.471993 0.022330 21.14 <2e-16 ***
## floors 0.069845 0.005551 12.58 <2e-16 ***
## condition 0.073112 0.003117 23.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(sqft_living15) 5.718 6.868 571.97 <2e-16 ***
## s(bathrooms) 6.932 7.709 159.04 <2e-16 ***
## s(sqft_basement) 5.319 6.303 22.83 <2e-16 ***
## s(bedrooms) 8.504 8.853 25.79 <2e-16 ***
## s(lat) 8.777 8.985 1854.81 <2e-16 ***
## s(long) 8.907 8.995 71.03 <2e-16 ***
## s(zipcode) 8.951 8.999 52.16 <2e-16 ***
## s(yr_built) 8.349 8.879 47.85 <2e-16 ***
## s(yr_renovated) 4.669 5.525 44.31 <2e-16 ***
## s(sqft_lot15) 8.339 8.865 148.07 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.812 Deviance explained = 81.3%
## GCV = 0.051849 Scale est. = 0.051609 n = 17151
anova(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond.lot)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) +
## waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode) +
## s(yr_built) + s(yr_renovated) + condition + s(sqft_lot15)
##
## Parametric Terms:
## df F p-value
## view 1 588.0 <2e-16
## waterfront 1 446.8 <2e-16
## floors 1 158.3 <2e-16
## condition 1 550.0 <2e-16
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(sqft_living15) 5.718 6.868 571.97 <2e-16
## s(bathrooms) 6.932 7.709 159.04 <2e-16
## s(sqft_basement) 5.319 6.303 22.83 <2e-16
## s(bedrooms) 8.504 8.853 25.79 <2e-16
## s(lat) 8.777 8.985 1854.81 <2e-16
## s(long) 8.907 8.995 71.03 <2e-16
## s(zipcode) 8.951 8.999 52.16 <2e-16
## s(yr_built) 8.349 8.879 47.85 <2e-16
## s(yr_renovated) 4.669 5.525 44.31 <2e-16
## s(sqft_lot15) 8.339 8.865 148.07 <2e-16
AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond)
## [1] -831.1318
AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond.lot)
## [1] -2083.016
We are going to show the summary of the AIC procedure we have followed:
library(formula.tools)
paste(model.gam.liv$formula %>% as.character(), "|AIC:", AIC(model.gam.liv))
## [1] "log(price) ~ s(sqft_living15) |AIC: 18098.268228551"
paste(model.gam.liv.bat$formula %>% as.character(), "|AIC:", AIC(model.gam.liv.bat))
## [1] "log(price) ~ s(sqft_living15) + s(bathrooms) |AIC: 16104.5097962339"
paste(model.gam.liv.bat.view$formula %>% as.character(), "|AIC:", AIC(model.gam.liv.bat.view))
## [1] "log(price) ~ s(sqft_living15) + s(bathrooms) + view |AIC: 15254.4711858212"
paste(model.gam.liv.bat.view.cbase$formula %>% as.character(), "|AIC:", AIC(model.gam.liv.bat.view.cbase))
## [1] "log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) |AIC: 14858.4149482997"
paste(model.gam.liv.bat.view.cbase.wtf$formula %>% as.character(), "|AIC:", AIC(model.gam.liv.bat.view.cbase.wtf))
## [1] "log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + waterfront |AIC: 14743.4713138703"
paste(model.gam.liv.bat.view.cbase.wtf2$formula %>% as.character(), "|AIC:", AIC(model.gam.liv.bat.view.cbase.wtf2))
## [1] "log(price) ~ s(sqft_living15) + s(bathrooms, by = waterfront) + view + s(sqft_basement) + waterfront |AIC: 14737.6882662845"
paste(model.gam.liv.bat.view.cbase.wtf2.bed$formula %>% as.character(), "|AIC:", AIC(model.gam.liv.bat.view.cbase.wtf2.bed))
## [1] "log(price) ~ s(sqft_living15) + s(bathrooms, by = waterfront) + view + s(sqft_basement) + waterfront + s(bedrooms) |AIC: 14679.0533318368"
paste(model.gam.liv.bat.view.cbase.wtf2.bed.flo$formula %>% as.character(), "|AIC:", AIC(model.gam.liv.bat.view.cbase.wtf2.bed.flo))
## [1] "log(price) ~ s(sqft_living15) + s(bathrooms, by = waterfront) + view + s(sqft_basement, by = waterfront) + waterfront + s(bedrooms) + floors |AIC: 14068.3004361546"
paste(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc$formula %>% as.character(), "|AIC:", AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc))
## [1] "log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + waterfront + s(bedrooms) + floors + s(lat) + s(long) |AIC: 618.040750878333"
paste(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip$formula %>% as.character(), "|AIC:", AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip))
## [1] "log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode) |AIC: 110.228331770422"
paste(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr$formula %>% as.character(), "|AIC:", AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr))
## [1] "log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode) + s(yr_built) + s(yr_renovated) |AIC: -327.45089088778"
paste(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond$formula %>% as.character(), "|AIC:", AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond))
## [1] "log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode) + s(yr_built) + s(yr_renovated) + condition |AIC: -831.131766486816"
paste(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond.lot$formula %>% as.character(), "|AIC:", AIC(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond.lot))
## [1] "log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) + waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode) + s(yr_built) + s(yr_renovated) + condition + s(sqft_lot15) |AIC: -2083.01626285448"
db_test = db_test_raw %>%
mutate(
bathrooms = as.numeric(bathrooms)
, floors = as.numeric(floors)
, waterfront = as.factor(as.numeric(as.factor(waterfront)) - 1)
, lat = as.numeric(lat)
, long = as.numeric(long)
) %>%
mutate(
lat = lat / (roundUp(abs(lat / 47))/10)
, long = long / (roundUp(abs(long / 120))/10)
) %>%
dplyr::select(-id, -date) %>%
data.frame()
db_test = db_test %>%
mutate(
sqft_living = log(sqft_living)
, sqft_lot = log(sqft_lot)
, sqft_above = log(sqft_above)
, sqft_living15 = log(sqft_living15)
, sqft_lot15 = log(sqft_lot15)
)
db_test$prediction = predict(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond.lot, db_test)
db_test %>% ggplot(aes(x = log(price), y = prediction)) +
geom_point() +
CSGo::theme_csgo()
This model predicts the price of the property really well as we can see in the scatter plot showed above. The points generated from the actual value and the predicted are around the diagonal, being a really good sign.
This would be our best hedonic pricing model!
summary(model.gam.liv.bat.view.cbase.wtf.bed.flo.loc.zip.yr.cond.lot)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(price) ~ s(sqft_living15) + s(bathrooms) + view + s(sqft_basement) +
## waterfront + s(bedrooms) + floors + s(lat) + s(long) + s(zipcode) +
## s(yr_built) + s(yr_renovated) + condition + s(sqft_lot15)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.678685 0.013373 948.09 <2e-16 ***
## view 0.067461 0.002782 24.25 <2e-16 ***
## waterfront1 0.471993 0.022330 21.14 <2e-16 ***
## floors 0.069845 0.005551 12.58 <2e-16 ***
## condition 0.073112 0.003117 23.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(sqft_living15) 5.718 6.868 571.97 <2e-16 ***
## s(bathrooms) 6.932 7.709 159.04 <2e-16 ***
## s(sqft_basement) 5.319 6.303 22.83 <2e-16 ***
## s(bedrooms) 8.504 8.853 25.79 <2e-16 ***
## s(lat) 8.777 8.985 1854.81 <2e-16 ***
## s(long) 8.907 8.995 71.03 <2e-16 ***
## s(zipcode) 8.951 8.999 52.16 <2e-16 ***
## s(yr_built) 8.349 8.879 47.85 <2e-16 ***
## s(yr_renovated) 4.669 5.525 44.31 <2e-16 ***
## s(sqft_lot15) 8.339 8.865 148.07 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.812 Deviance explained = 81.3%
## GCV = 0.051849 Scale est. = 0.051609 n = 17151